Introduction

Use of the programming language R for data processing and statistical analysis by researchers is increasingly common with a GET THE STATS FROM THAT THING I SAW ABOUT R ON TWITTER increase since XXXX (REF). In addition to benefiting reproducibility and transparency, one of the advantages of using R is that researchers have a much larger range of fully customisable data visualisations options than are typically available in point-and-click software, due to the open-source nature of R. These visualisation options not only look attractive, but can increase transparency about the distribution of the underlying data rather than relying on commonly used visualisations of aggregations such as bar charts of means (REF HERE ABOUT BAR CHARTS).

Yet, the benefits of using R are hindered by its notoriously steep learning curve (REF - is there a ref?) and that that only a minority of psychology programmes currently teach programming skills (REF, PsyTeachR ref) with the majority of both undergraduate and postgraduate courses using proprietary point-and-click software such as SPSS or Microsoft Excel.

In this paper we aim to provide a practical introduction to data visualisation using R, specifically aimed at researchers who have little to no prior experience using R. We detail the rationale for using R for data visualisation, introduce the “grammar of graphics” that underlies data visualisation using the ggplot package, and provide a tutorial that walks the reader through how to replicate plots that are commonly available in point-and-click software such as histograms and boxplots, as well as showing how the code for these “basic” plots can be easily extended to less commonly available options such as violin-boxplots, raincloud plots, and heat-maps.

Why R for data visualisation?

Data visualisation benefits from the same advantages as statistical analysis when writing code rather than using point-and-click software – reproducibility and transparency. The need for psychological researchers to work in reproducible ways has been well-documented and discussed in response to the replication crisis (REFs) and we will not repeat these arguments here. However, there is an additional selfish benefit to reproducibility that is less frequently acknowledged compared to the loftier goals of improving psychological science: if you write code to produce your plots, future-you can reuse and adapt that code rather than starting from scratch each time.

In addition to the benefits of reproducibility, using R for data visualisation gives the researcher almost total control over each element of the plot. Whilst this flexibility can seem daunting to novice users of R, if one can survive the initial learning curve, the ability to write reusable code recipes (and use recipes created by others) is highly advantageous. The level of customisation and the professional outputs available using R has led to news outlets such as the BBC (Visual and Journalism 2019) and the New York Times (Bertini and Stefaner 2015) to adopt R as their preferred data visualisation tool.

A layered grammar of graphics

There are multiple approaches to data visualisation in R; in this paper we will be using the popular package1 ggplot2 (Wickham 2016) which is part of the larger tidyverse2 (Wickham 2017) collection of packages that provide functions for data wrangling, descriptives, and visualisation. A grammar of graphics (Wilkinson, Anand, and Grossman 2005) is a standardised way to describe the components of a graphic. ggplot2 uses a layered grammar of graphics (Wickham 2010), in which plots are built up in a series of layers. This approach may be familiar to users of MATLAB but can be unintuitive to those used to creating plots in Excel or SPSS.

Figure 1 displays the evolution of a simple scatterplot using this layered approach. First, the plot space is built (layer 1); the variables are specified (layer 2); the type of visualisation (known as a geom) that is desired for these variables is specified (layer 3) - in this case geom_point() is called to visualise individual data points; a second geom is added to include a line of best fit (layer 4), the axis labels are edited for readability (layer 5), and finally, a theme is applied to change the overall appearance of the plot (layer 6).

Evolution of a layered plot

Figure 1: Evolution of a layered plot

Importantly, each layer is independent and independently customisable. For example, the size, color and position of each component can be adjusted, or one could, for example, remove the first geom to only visualise the line of best fit, simply by removing the layer that draws the data points (Figure  2). The use of layers makes it easy to build up complex plots step-by-step, and to adapt or extend plots from existing code.

Plot with scatterplot layer removed.

Figure 2: Plot with scatterplot layer removed.

In this tutorial for each step we will explain the purpose of the code, provide the code to achieve that aim, and provide a plain English translation of what each line of code is doing. For each function, we will provide additional resources in the appendix that provide further instruction.

The simulated dataset and all below code can be found in the online supplementary materials.

Preparing your data

Before you start visualizing your data, you need to get the data into an appropriate format. These preparatory steps can all be dealt with reproducibly using R and the additional resources section points to extra tutorials for doing so. However, performing these types of tasks in R can be difficult for new learners and the solutions and tools are dependent on the idiosyncrasies of each dataset. For this reason, in this tutorial we encourage the reader to complete these steps using the method they are most comfortable with and to focus on the aim of data visualisation.

Data format

The simulated Stroop data is provided in a csv file rather than e.g., xslx. Functions exist in R to read many other types of data files, however, you can convert an xlsx spreadsheet to csv by using the Save As function in Microsoft Excel. Note that csv files strip all formatting and only store data in a single sheet; you may wish to create a csv file that only contains the data you wish to visualise that is part of a larger workbook. When working with your own data, any files you import should remove summary rows or additional notes and should only contains the rows and columns you want to plot.

Variable names

Ensuring that your variable names are consistent can make it much easier to work in R. We recommend using short but informative variable names, for example rt_cong is preferred over dv1_iv1 or reaction_time_congruent_condition because these are either hard to read or hard to type.

It is also helpful to have a consistent naming scheme, particularly for variable names that require more than one word. Two popular options are CamelCase where each new word begins with a capital letter, or snake_case where all letters are lower case and words are separated by an underscore. For the purposes of naming variables, avoid using any spaces in variable names (e.g., rt cong) and consider the additional meaning of a separator beyond making the variable names easier to read. For example, rt_cong, rt_incon, acc_cong, and acc_incon all have the DV to the left of the separator and the level of the IV to the right. rt_cong_condition on the other hand has two separators but only one of them is meaningful and it is useful to be able to split variable names consistently. In this paper, we will use snake_case and lower case letters for all variable names so that we don’t have to remember where to put the capital letters.

When working with your own data, you can rename columns in Excel, but there are also instructions for how to do this in R in the additional resources.

Data values

A great benefit to using R rather than SPSS is that categorical data can be entered as text. In the tutorial dataset, language group is entered as 1 or 2, so we can show you below how to recode numeric values into factors with labels, which also sets the order of display for graphs. However, we recommend recording meaningful labels rather than numbers to avoid misinterpreting data due to coding errors. Note that values must match exactly in order to be considered in the same category and R is case sensitive, so “mono,” “Mono,” and “monolingual” would be classified as members of three separate categories.

Finally, cells that represent missing data should be left empty rather than containing values like NA, missing or 0. A complementary rule of thumb is that each column should only contain one type of data, such as words or numbers, not both.

Setting up R and RStudio

We strongly encourage the use of RStudio to write code in R. R is the programming language whilst RStudio is an integrated development environment that makes working with R easier. More information on installing both R and RStudio can be found in the additional resources.

Projects are a useful way of keeping all your code, data, and output in one place. To create a new project, open RStudio and click File - New Project - New Directory - New Project. You will be prompted to give the project a name, and select a location for where to store the project on your computer. Once you have done this, click Create Project. Copy the dataset to this folder. The files pane on the bottom right of RStudio should now display this folder and the files it contains - this is know as your working directory and it is where R will look for any data you wish to import and where it will save any output you create.

This tutorial will require you to use the packages contained with the tidyverse collection. Additionally, we will also require use of patchwork. To install these packages, copy and paste the below code into the console (the left hand pane) and press enter to execute the code.

# only run in the console, never put this in a script 
package_list <- c("tidyverse", "patchwork")
install.packages(package_list)

Finally, so that you can save your code to return to later, open a new script File - New File - R Script and then save it using File - Save. R will default to saving the script in your project folder. This is where we will write all the tutorial code from now on. The reason that the install packages code is not included in the script is that every time you run the install command code it will install the latest version of the package and so leaving this code in your script can lead you to unintentionally install a package update you didn’t want. For this reason, avoid including install code in any script. R scripts treat anything you write in them as code by default. If you wish to write a comment with further information about your code (which we encourage you to do), you must use the hashtag sign to “comment it out.”

this_is_code
# this is a comment

Loading packages

To load the packages that have the functions we need, use the library() function. Whilst you only need to install packages once, you need to load any packages you want to use with library() every time you start R. When you load the tidyverse, you actually load several separate packages that are all part of the same collection and have been designed to work well together. R will produce a message that tells you the names of all the packages that have been loaded.

library(tidyverse)
library(patchwork)

Loading data

For the purpose of this tutorial, we will use simulated data for a 2 x 2 mixed-design Stroop test experiment. There are 100 rows (1 for each subject) and 7 variables:

  • Participant information:
    • id: Participant ID
    • age: Age
  • 1 between-subject IV:
    • language: Language group (1 = monolingual/2 = bilingual)
  • 4 columns for the 2 dependent variables for RT and accuracy, crossed by the within-subject IV of condition:
    • rt_cong`: Reaction time (ms) for congruent trials
    • rt_incon`: Reaction time (ms) for incongruent trials
    • acc_cong: Accuracy for congruent trials
    • acc_incon: Accuracy for incongruent trials

To load the simulated data we use the function read_csv() from the readr tidyverse package.

dat <- read_csv(file = "stroop_data.csv")

This code has created an object dat that contains the data from the file stroop_data.csv. This object will appear in the environment pane in the top right. Note that the name of the data file must be in quotation marks and the file extension (.csv) must also be included. If you receive the error …does not exist in current working directory it is highly likely that you have made a typo in the file name (remember R is case sensitive), have forgotten to include the file extension, or that the data file you want to load is not stored in your project folder. If you get the error could not find function it means you have either not loaded the correct package, or you have made a typo in the function name.

To view the dataset, click dat in the environment pane or run View(dat) in the console. The environment pane also tells us that the object dat has 100 observations of 7 variables, and is a useful quick check to ensure one has loaded the right data.

Handling numeric factors

The factor language is coded as 1 and 2, but you probably want to use the labels “monolingual” and “bilingual” for tables and plots. The code below shows how to recode numeric codes into labels. The mutate() function makes new columns in a data table, or overwrites a column, and the factor() function translates the language column into a factor with the labels “monolingual” and “bilingual”). You can also use the factor() function to set the display order of a column that contains words. Otherwise, they will display in alphabetical order.

dat <- dat %>%
  mutate(language = factor(
    x = language, # column to translate
    levels = c(1, 2), # values of the original data in preferred order
    labels = c("monolingual", "bilingual") # labels for display
  ))

Argument names

NOTE: moved up from bar to first relevant use of arguments

Each function has a list of arguments it can take, and a default order for those arguments. You can get more information on each function by entering ?function_name into the console. When you are writing R code, as long as you stick to the default order, you do not have to explicitly call the argument names, for example, the above code could also be written as:

dat <- dat %>%
  mutate(language = factor(language, c(1, 2), c("monolingual", "bilingual")))

One of the barriers to learning R is that many of the “helpful” examples and solutions you will find online do not include argument names and so for novice learners are completely opaque. In this tutorial, we will include the argument names the first time a function is used, however, we will remove some argument names from subsequent examples to facilitate knowledge transfer to the help available online.

Demographic information

We can calculate and plot some basic descriptive information about the demographics of our sample using the imported dataset without any additional wrangling. The code below uses the %>% operator, otherwise known as the pipe, and can mostly useful be translated as "and then". For example, the below code can be read as:

  • Start with the object demographics and then;

  • Group it by the variable language and then;

  • Count the number of observations in each group

dat %>%
  group_by(language) %>%
  count()
language n
monolingual 55
bilingual 45

This will produce the number of observations in each group of the variable language. If you just need the total number of observations, you could remove the group_by() line which would perform the operation on the whole dataset, rather than by groups:

dat %>%
  count()
n
100

Similarly, we may wish to calculate the mean age (and SD) of the sample and we can do so using the function summarise() from the dplyr tidyverse package.

dat %>%
  summarise(mean_age = mean(age),
            sd_age = sd(age))
mean_age sd_age
29.87 7.998302

This code produces summary data in the form of a column named mean_age that contains the result of calculating the mean of the variable age. It then creates sd_age which does the same but for standard deviation. Note that the above code will not save the result of this operation, it will simply output the result in the console. If you wish to save it for future use, you can store it in an object by using the <- notation and print it later by typing the object name.

age_stats <- dat %>%
  summarise(mean_age = mean(age),
            sd_age = sd(age))

Finally, the group_by() function will work in the same way when calculating summary statistics - the output of the function that is called after group_by() will be produced for each level of the grouping variable.

dat %>%
  group_by(language) %>%
  summarise(mean_age = mean(age),
            sd_age = sd(age))
language mean_age sd_age
monolingual 31.18182 7.303198
bilingual 28.26667 8.584870

Bar chart of counts

For our first plot, we will make a simple bar chart of counts that shows the number of participants in each language group.

ggplot(data = dat, mapping = aes(x = language)) +
  geom_bar()
Bar chart of counts.

Figure 3: Bar chart of counts.

The first line of code sets up the first layer and base of the plot.

  • data specifies which data source to use for the plot

  • mapping specifies which variables to map to which aesthetics (aes) of the plot

  • x specifies which variable to put on the x-axis

The second line of code draws the plot, and is connected to the base code with +. In this case, we ask for geom_bar(). Each geom, or type of plot, has an associated default statistic. For bar charts, the default statistic is to count the data passed to it. This means that you do not have to specify a y variable when making a bar plot of counts, when given an x variable, geom_bar() will automatically calculate counts of the groups in that variable. In this example, it counts the number of data points that are in each category of the language variable.

ADDED: maybe too much?

If your data already have the counts that you want to plot, you can set stat="identity" inside of geom_bar() to use that number instead of counting rows. Notice that we are now omitting the names of the arguments data and mapping in the ggplot() function.

count_dat <- dat %>%
  group_by(language) %>%
  count()

ggplot(count_dat, aes(x = language, y = n)) +
  geom_bar(stat="identity")
Bar chart of pre-calculated counts.

Figure 4: Bar chart of pre-calculated counts.

Histogram

The code to plot a histogram age is very similar to the bar chart. We start by setting up the plot space, the dataset we want to use, and mapping the variables to the relevant axis. In this case, we want to plot a histogram with age on the x-axis:

ggplot(dat, aes(x = age)) +
  geom_histogram()
Histogram of ages.

Figure 5: Histogram of ages.

By default geom_histogram() divides the x-axis into “bins” and counts how many observations are in each bin and so the y-axis does not need to be specified - it is automatically a count. When you run the code to produce the histogram, you will get the message stat_bin() using bins = 30. Pick better value with binwidth. This means that the default number of bins geom_histogram() divides the x-axis into is 30, but we may wish to pick a more appropriate value for our dataset. If we want one bar to equal one year, we can adjust binwidth = 1.

ggplot(dat, aes(x = age)) +
  geom_histogram(binwidth = 1)
Histogram of ages where each bin covers one year.

Figure 6: Histogram of ages where each bin covers one year.

Visual improvements part 1

So far we have made basic plots with the default visual appearance, before we move on to the experimental data we will introduce some simple visual customisation options.

Editing axis names and labels

To edit axis names and labels we can use scale_ functions. There are a number of different scale functions and which one you use depends on which aesthetic you wish to edit (e.g., x, y, fill, color) and the type of data it represents (discrete, continuous).

For the bar chart of counts, the x-axis is mapped to a discrete (categorical) variable whilst the y-axis is continuous. Each axis has its own function, and its own layer:

ggplot(dat, aes(language)) +
  geom_bar() +
  scale_x_discrete(name = "Language group", 
                   labels = c("Bilingual", "Monolingual")) +
  scale_y_continuous(name = "Number of participants",
                     breaks = c(0,10,20,30,40,50))
Bar chart with custom axis labels.

Figure 7: Bar chart with custom axis labels.

  • name controls the overall name of the axis

  • labels controls the names of the conditions with a discrete variable. Note that the conditions are combined within c(), are each enclosed within their own parenthesis, and are in the order displayed on the plot.

  • breaks controls the tick marks on the axis.

Changing colors

ADDED

You can control colors used to display the bars by setting fill (internal color) and color (outline color) inside the geom function. This methods is only for changing all the bars; we will show you later how to set fill or color separately for different groups.

ggplot(dat, aes(age)) +
  geom_histogram(binwidth = 1, fill = "white", color = "black") +
  scale_x_continuous(name = "Participant age (years)")
Histogram with custom colors for bar fill and line colors.

Figure 8: Histogram with custom colors for bar fill and line colors.

Adding a theme

ggplot has a number of built-in visual themes that you can apply as an extra layer. The below code updates the x-axis label to the histogram, but also applies theme_minimal(). Each part of a theme can be independently customised, which may be necessary, for example, if you have journal guidelines on fonts for publication. There are further instructions for how to do this in the additional resources.

ggplot(dat, aes(age)) +
  geom_histogram(binwidth = 1, fill = "white", color = "black") +
  scale_x_continuous(name = "Participant age (years)") +
  theme_minimal()
Histogram with a custom theme.

Figure 9: Histogram with a custom theme.

ADDED

You can set the theme globally so that all subsequent plots use a theme.

theme_set( theme_minimal() )

Activities 1

Before you move on try the following:

  • Add a layer that edits the name of the y-axis histogram label to Number of participants.

  • Change the color of the bars in the bar chart to red.

  • Remove theme_minimal() and instead apply one of the other available themes (start typing theme_ and the auto-complete will show you the available options).

Data format

Traditionally, wide-format data is the preferred or default format. Wide-format data typically has one row of data for each participant with separate columns for each score or variable. Where there are repeated-measures variables, the dependent variable is split across different columns with one measurement for each condition.

The simulated Stroop data is currently in wide-format (see Table 1) where each participant’s aggregated3 reaction time and accuracy for each level of the within-subject variable is split across multiple columns.

Table 1: Data in wide format.
id age language rt_cong rt_incon acc_cong acc_incon
S001 25 monolingual 369.46 666.82 99 90
S002 37 monolingual 302.45 585.04 94 82
S003 26 monolingual 394.94 608.50 96 87
S004 32 monolingual 288.37 485.89 92 76
S005 28 monolingual 306.42 551.32 91 83
S006 34 monolingual 347.17 517.34 96 78

This format is popular because it is intuitive to read and easy to enter data into as all the data for one participant is contained within a single row. However, for the purposes of analysis, and particularly for analysis using R, this format is unsuitable because whilst it is intuitive to read by a human, the same is not true for a computer . Wide-format data combines multiple variables in a single cell, for example in Table 1, rt_cong contains information related to both a DV and the level of an IV. Wickham (2014) provides a comprehensive overview of the benefits of tidy data as a standard way of mapping a dataset to its structure, but for the purposes of this tutorial there are two important rules: each column should be a variable and each row should be an observation.

Moving from using wide-form to long-form datasets can require a conceptual shift on the part of the researcher and one that usually only comes with practice and repeated exposure4. For our example dataset, adhering to these rules would produce Table 2. Rather than different observations of the same dependent variable being split across columns, there is now a single column for the DV reaction time, and a single column for the DV accuracy. Each participant now has multiple rows of data, one for each observation (i.e., for each participant there will be as many rows as there are levels of the within-subject IV). Although there is some repetition of age and language group, each row is unique when looking at the combination of measures.

Table 2: Data in the correct format for visualization.
id age language condition rt acc
S001 25 monolingual cong 369.46 99
S001 25 monolingual incon 666.82 90
S002 37 monolingual cong 302.45 94
S002 37 monolingual incon 585.04 82
S003 26 monolingual cong 394.94 96
S003 26 monolingual incon 608.50 87

The benefits and flexibility of this format will (hopefully) become apparent as we progress through the tutorial, however, a useful rule of thumb when working with data in R for visualisation is that anything that shares an axis should probably be in the same column. For example, a simple bar chart of means for the reaction time DV would display the variable condition on the x-axis with bars representing both the cong and incon data, therefore, these data should be in one column and not split.

Transforming data

We have chosen a 2 x 2 design with two DVs as we anticipate that this is a design many researchers will be familiar with and may also have existing datasets with a similar structure. However, it is worth highlighting that trial-and-error is part of the process of learning how to apply these functions to new datasets and structures. Data visualisation can be a useful way to scaffold learning these data transformations because they can provide a concrete visual check as to whether you have done what you intended to do with your data.

Step 1: pivot_longer()

The first step is to use the function pivot_longer() to transform the data to long-form. The pivot functions can be easier to show than tell - you may find it a useful exercise to run the below code and compare the newly created object long (Table 3) with the original dat Table 1 before reading on.

long <- pivot_longer(data = dat, 
                     cols = rt_cong:acc_incon, 
                     names_sep = "_", 
                     names_to = c("dv_type", "condition"),
                     values_to = "dv")
  • As with the other tidyverse functions, the first argument specifies the dataset to use as the base, in this case dat. This argument name is often dropped in examples.

  • cols specifies all the columns you want to transform. The easiest way to visualise this is to think about which columns would be the same in the new long-form dataset and which will change. If you refer back to Table 1 and 2, you can see that id, age, and language all remain, it is the columns that contain the measurements of the DVs that change. Therefore cols specifies that the columns we want to transform are rt_cong to acc_incon which will select those four columns.

  • names_sep specifies how to split up the variable name in cases where it has multiple components. This is when taking care to name your variables consistently and meaningfully pays off. Because the word to the left of the separator (_) is always the DV type and the word to the right is always the condition of the within-subject IV, it is easy to automatically split the columns.

  • names_to specifies the names of the new columns that will be created. There are two: one for the text to the left of the separator, and one for the text to the right of the separator.

  • Finally, values_to names the new column that will contain the measurements, in this case we’ll call it dv. At this point you may find it helpful to go back and compare dat and long again to see how each argument matches up with the output of the table.

Table 3: Data in long format.
id age language dv_type condition dv
S001 25 monolingual rt cong 369.46
S001 25 monolingual rt incon 666.82
S001 25 monolingual acc cong 99.00
S001 25 monolingual acc incon 90.00
S002 37 monolingual rt cong 302.45
S002 37 monolingual rt incon 585.04

Step 2: pivot_wider()

Because we have two DVs, we need to perform an additional step. In the current long-form dataset, the column dv contains both reaction time and accuracy measures and keeping in mind the rule of thumb that anything that shares an axis should probably be in the same column, this creates a problem because we cannot plot two different units of measurement on the same axis. To fix this we need to use the function pivot_wider(). Again, we would encourage you at this point to compare long and dat_long with the below code to try and map the connections before reading on.

dat_long <- pivot_wider(long, 
                        names_from = "dv_type", 
                        values_from = "dv")
  • The first argument is again the dataset you wish to work from, in this case long. We have removed the argument name data in this example.

  • names_from acts somewhat like the reverse of names_to from pivot_longer(). It will take the values from the variable specified and use these as variable names, i.e., in this case, the values of rt and acc that are currently in the dv_type column, and turn these into the column names.

  • Finally, values_from specifies the values to fill the new columns with. In this case, the new columns rt and acc will be filled with the values that were in dv. Again, it can be helpful to compare each dataset with the code to see how it aligns.

This final long-form data should looks as follows:

REDUNDANT WITH TABLE 2

Table 4: Data in the format required for visualization.
id age language condition rt acc
S001 25 monolingual cong 369.46 99
S001 25 monolingual incon 666.82 90
S002 37 monolingual cong 302.45 94
S002 37 monolingual incon 585.04 82
S003 26 monolingual cong 394.94 96
S003 26 monolingual incon 608.50 87

We have purposefully used a more complex dataset with two DVs for this tutorial. If you are working with a dataset with only one DV, note that only step 1 of this process would be necessary, potentially with the removal of the names_sep argument. Finally, be careful not to calculate demographic descriptive statistics from this long-form dataset. Because the process of transformation has introduced some repetition for these variables, the wide-form dataset where 1 row = 1 participant should be used for demographic information.

Histogram 2

Now that we have the experimental data in the right form, we can begin to create some useful visualizations. First, to demonstrate how code recipes can be reused and adapted, we will create histograms of reaction time and accuracy. The below code uses the same template as before but changes the dataset (dat_long), the binwidths of the histograms, the x variable to display (rt/acc), and the name of the x-axis.

ggplot(dat_long, aes(x = rt)) +
  geom_histogram(binwidth = 10, fill = "white", color = "black") +
  scale_x_continuous(name = "Reaction time (ms)")
Histogram of reaction times.

Figure 10: Histogram of reaction times.

ggplot(dat_long, aes(x = acc)) +
  geom_histogram(binwidth = 1, fill = "white", color = "black") +
  scale_x_continuous(name = "Accuracy (0-100)")
Histogram of accuracy scores.

Figure 11: Histogram of accuracy scores.

Density plots

The layer system makes it easy to create new types of plots by adapting existing recipes. For example, rather than creating a histogram, we can create a smoothed density plot by calling geom_density() rather than geom_histogram(). The rest of the code remains identical.

ggplot(dat_long, aes(x = rt)) +
  geom_density()+
  scale_x_continuous(name = "Reaction time (ms)")
Density plot of reaction time.

Figure 12: Density plot of reaction time.

Grouped density plots

Density plots are most useful for comparing the distributions of different groups of data. Because the dataset is now in long format, it makes it easier to map another variable to the plot. In addition to mapping rt to the x-axis, we specify the fill aesthetic to fill the visualisation of each level of the condition variable with different colors. Just like with the x and y-axis scale functions, we can edit the names and labels of our fill aesthetic by adding on another scale_*_*() layer.

ggplot(dat_long, aes(x = rt, fill = condition)) +
  geom_density()+
  scale_x_continuous(name = "Reaction time (ms)") +
  scale_fill_discrete(name = "Condition",
                      labels = c("Congruent", "Incongruent"))
Density plot of reaction times grouped by condition.

Figure 13: Density plot of reaction times grouped by condition.

ADDED

Note that the fill here is set inside the aes() function, which tells ggplot to set the fill differently for each value in the condition column. You cannot specify which color here (e.g., fill="red"), like you could when you set fill inside the geom_*() function before.

Scatterplots

Scatterplots are created by calling geom_point() and require both an x and y variable to be specified in the mapping.

ggplot(dat_long, aes(x = rt, y = age)) +
  geom_point()
Point plot of reaction time versus age.

Figure 14: Point plot of reaction time versus age.

A line of best fit can be added with an additional layer that calls the function geom_smooth(). The default is to draw a LOESS or curved regression line, however, a linear line of best fit can be specified using method = "lm". By default, geom_smooth() will also draw a confidence envelope around the regression line, this can be removed by adding se = FALSE to geom_point().

ggplot(dat_long, aes(x = rt, y = age)) +
  geom_point() +
  geom_smooth(method = "lm")
Line of best fit for reaction time versus age.

Figure 15: Line of best fit for reaction time versus age.

Grouped scatterplots

Similar to the density plot, the scatterplot can also be easily adjusted to display grouped data. For geom_point(), the grouping variable is mapped to color rather than fill and the relevant scale_*_*() function is added.

ggplot(dat_long, aes(x = rt, y = age, color = condition)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_color_discrete(name = "Condition",
                      labels = c("Congruent", "Incongruent"))
Grouped scatter plot of reaction time versus age by condition.

Figure 16: Grouped scatter plot of reaction time versus age by condition.

Visual improvements 2

Accessible color schemes

One of the drawbacks of using ggplot for visualisation is that the default color scheme is not accessible (or visually appealing). The red and green default palette is difficult for color-blind people to differentiate, and also does not display well in grey scale. You can specify exact custom colors for your plots, but one easy option is to use the viridis scale functions. These take the same arguments as their default sister functions for updating axis names and labels, but display plots in contrasting colors that can be read by color-blind people and that also print well in grey scale. The viridis scale functions provide a number of different options for the color – try setting option to any letter from A - E to see the different sets.

ggplot(dat_long, aes(x = rt, y = age, color = condition)) +
  geom_point() +
  geom_smooth(method = "lm") +
  # use "viridis_d" instead of "discrete" for better colors
  scale_color_viridis_d(name = "Condition",
                        labels = c("Congruent", "Incongruent"),
                        option = "E")
Use the viridis color scheme for accessibility.

Figure 17: Use the viridis color scheme for accessibility.

Activities 2

Before you move on try the following:

  • Use fill to created grouped histograms that display the distributions for rt for each language group separately and also edit the fill axis labels. Try adding position = "dodge" to geom_histogram() to see what happens.

  • Use scale_*_*() functions to edit the name of the x and y-axis on the scatterplot

  • Use se = FALSE to remove the confidence envelope from the scatterplots

  • Remove method = "lm" from geom_smooth() to produce a curved regression line.

  • Replace the default scale_fill_*() on the grouped density plot with the color-blind friendly version.

Boxplots

As with geom_point(), boxplots also require an x and y-variable to be specified. In this case, x must be a discrete, or categorical variable, whilst y must be continuous.

ggplot(dat_long, aes(x = condition, y = acc)) +
  geom_boxplot()
Basic boxplot.

Figure 18: Basic boxplot.

Grouped boxplots

As with histograms and density plots, fill can be used to create grouped boxplots:

ggplot(dat_long, aes(x = condition, y = acc, fill = language)) +
  geom_boxplot() +
  scale_fill_viridis_d(option = "D",
                       name = "Group",
                       labels = c("Bilingual", "Monolingual")) +
  theme_classic() +
  scale_x_discrete(name = "Condition",
                   labels = c("Congruent", "Incongruent")) +
  scale_y_continuous(name = "Accuracy")
**CAPTION THIS**

Figure 19: CAPTION THIS

Violin plots

Violin plots display the distribution of a dataset and can be created by calling geom_violin(). They are so-called because the shape they make sometimes looks something like a violin. They are essentially a mirrored density plot on its side. Note that the below code is identical to the code used to draw the boxplots above, except for the call to geom_violin() rather than geom_boxplot().

ggplot(dat_long, aes(x = condition, y = acc, fill = language)) +
  geom_violin() +
  scale_fill_viridis_d(option = "D",
                       name = "Group",
                       labels = c("Bilingual", "Monolingual")) +
  theme_classic() +
  scale_x_discrete(name = "Condition",
                   labels = c("Congruent", "Incongruent")) +
  scale_y_continuous(name = "Accuracy")
Violin plot.

Figure 20: Violin plot.

Bar chart of means

Commonly, rather than visualising distributions of raw data researchers will wish to visualise the mean using a bar chart with error bars. Although this is one of the most common data visualisations, it is perhaps the least intuitive for novice learners of R to achieve in ggplot. We present this code here because it is a common visualisation, however, we would urge you to use a better visualisation that provides more transparency about the distribution of the raw data such as the violin-boxplots we will present in the next section.

Rather than calling a specified geom, we call stat_summary().

  • fun specifies the a summary function that gives us the y-value we want to plot, in this case, mean

  • geom specifies what shape or plot we want to use to display the summary. For the first layer we will specify bar.

ggplot(dat_long, aes(x = condition, y = rt)) +
  stat_summary(fun = "mean", geom = "bar")
Bar plot of means.

Figure 21: Bar plot of means.

To add the error bars, another layer is added with a second call to stat_summary. This time, the function represents the type of error bars we wish to draw, you can choose from mean_se for standard error, mean_cl_normal for confidence intervals, or mean_sdl for standard deviation. width controls the width of the error bars - try changing the value to see what happens.

  • fun.data specifies the a summary function that gives us the y-values we want to plot plus their minimum and maximum values, in this case, mean_se
ggplot(dat_long, aes(x = condition, y = rt)) +
  stat_summary(fun = "mean", geom = "bar") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .2)
Bar plot of means with error bars representing SE.

Figure 22: Bar plot of means with error bars representing SE.

Violin-boxplot

The power of the layered system is further highlighted by the ability to combine different types of plots. For example, rather than using a bar chart with error bars, one can easily create a single plot that includes a violin plot, boxplot, and the mean with error bars. This plot requires just two extra lines of code to produce than the bar plot with error bars, yet the amount of information displayed is vastly superior.

ggplot(dat_long, aes(x = condition, y= rt)) +
  geom_violin() +
  # remove the median line with fatten = NULL
  geom_boxplot(width = .2, fatten = NULL) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1)
Violin-boxplot with mean dot and standard error bars.

Figure 23: Violin-boxplot with mean dot and standard error bars.

It is important to note that the order of the layers matters. For example, if we call geom_boxplot() followed by geom_violin(), we get the following mess:

ggplot(dat_long, aes(x = condition, y= rt)) +
  geom_boxplot() +  
  geom_violin() +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1)
Plot with the geoms in the wrong order.

Figure 24: Plot with the geoms in the wrong order.

Grouped violin-boxplots

As with previous plots, another variable can be mapped to fill for the violin-boxplot, however, simply adding fill to the mapping causes the different components of the plot to become misaligned because they have different default positions:

ggplot(dat_long, aes(x = condition, y= rt, fill = language)) +
  geom_violin() +
  geom_boxplot(width = .2, fatten = NULL) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1)
Grouped violin-boxplots without repositioning.

Figure 25: Grouped violin-boxplots without repositioning.

To rectify this we need to adjust the argument position for each of the misaligned layers. position_dodge() instructs R to move (dodge) the position of the plot component by the specified value - what value you need can sometimes take trial and error.

ggplot(dat_long, aes(x = condition, y= rt, fill = language)) +
  geom_violin() +
  geom_boxplot(width = .2, fatten = NULL, position = position_dodge(.9)) +
  stat_summary(fun = "mean", geom = "point", 
               position = position_dodge(.9)) +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1,
               position = position_dodge(.9))
Grouped violin-boxplots with repositioning.

Figure 26: Grouped violin-boxplots with repositioning.

Visual improvements 3

Combining multiple type of plots can present an issue with the colors, particularly when the viridis scheme is used - in the below example it is hard to make out the black lines of the boxplot and the mean/errorbars.

ggplot(dat_long, aes(x = condition, y= rt, fill = language)) +
  geom_violin() +
  geom_boxplot(width = .2, fatten = NULL, position = position_dodge(.9)) +
  stat_summary(fun = "mean", geom = "point", 
               position = position_dodge(.9)) +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1,
               position = position_dodge(.9)) +
  scale_fill_viridis_d(option = "E") +
  theme_minimal()
A color scheme that makes lines difficult to see.

Figure 27: A color scheme that makes lines difficult to see.

There are a number of solutions to this problem. First, we can change the color of individual geoms by adding color = "color" to each relevant geom:

ggplot(dat_long, aes(x = condition, y= rt, fill = condition)) +
  geom_violin() +
  geom_boxplot(width = .2, fatten = NULL, color = "grey") +
  stat_summary(fun = "mean", geom = "point", color = "grey") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1, color = "grey") +
  scale_fill_viridis_d(option = "E") +
  theme_minimal()
Manually changing the line colors.

Figure 28: Manually changing the line colors.

We can also keep the original colors but adjust the transparency of each layer using alpha. Again, the exact values needed can take trial and error:

ggplot(dat_long, aes(x = condition, y= rt, fill = condition)) +
  geom_violin(alpha = .4) +
  geom_boxplot(width = .2, fatten = NULL, alpha = .5) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1) +
  scale_fill_viridis_d(option = "E") +
  theme_minimal()
Using transparency on the fill color.

Figure 29: Using transparency on the fill color.

Activities 3

Before you go on, do the following:

  • Review all the code you have run so far. Try to identify the commonalities between each plot’s code and the bits of the code you might change if you were using a different dataset.

  • Take a moment to recognise the complexity of the code you are now able to read.

  • In stat_summary for geom = “point,” try changing fun to median

  • In stat_summary for geom = “errorbar,” try changing fun.data to mean_cl_normal (for 95% CI)

  • Go back to the grouped density plots and try changing the transparency with alpha.

Interaction plots

Interaction plots are commonly used to help display or interpret a factorial design. Just like with the bar chart of means, interaction plots represent data summaries and so they are built up with a series of calls to stat_summary().

  • shape acts much like fill in previous plots, except that rather than producing different color fills for each level of the IV, the data points are given different shapes.

  • size lets you change the size of lines and points. You usually don’t want different groups to be different sizes, so this option is set inside the relevant geom_*() function, not inside the aes() function.

  • scale_color_manual() works much like scale_color_discrete() except that it lets you specifiy the color values manually.

I DON’T KNOW WHY THIS CODE ISN’T LETTING ME ADJUST THE VIRIDIS color OPTION, WHEN I CHANGE IT NOTHING HAPPENS LISA - it changes for me, but this might be a place to introduce scale_color_manual() ?

ggplot(dat_long, aes(x = condition, y = rt, 
                     shape = language,
                     group = language,
                     color = language)) +
  stat_summary(fun = "mean", geom = "point", size = 3) +
  stat_summary(fun = "mean", geom = "line") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .2) +
  scale_color_manual(values = c("blue", "darkorange")) +
  theme_classic()
Interaction plot.

Figure 30: Interaction plot.

ggplot will scale the axis start and end points to the data. If you wish to adjust these, you can use the scale_ functions, for example, the below code specifies the limits (start and end points) of the y-axis should be 0 - 750 ms.

ggplot(dat_long, aes(x = condition, y = rt, 
                     group = language,
                     color = language)) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun = "mean", geom = "line") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .2) +
  scale_color_manual(values = c("blue", "darkorange")) +
  theme_classic() +
  scale_y_continuous(limits = c(0, 750))
**CAPTION THIS**

Figure 31: CAPTION THIS

  • Raincloud plots
  • Heatmaps
  • Saving plots
  • Facets
  • p + notation explain what it means, use of layers

Additional resources

References

Bertini, Enrico, and Moritz Stefaner. 2015. “Amanda Cox on Working with r, NYT Projects, Favorite Data [Podcast].” Data Stories. https://datastori.es/ds-56-amanda-cox-nyt/.
Visual, BBC, and Data Journalism. 2019. “How the BBC Visual and Data Journalism Team Works with Graphics in r.” Medium. https://medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-team-works-with-graphics-in-r-ed0b35693535.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, and others. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.
Wilkinson, Leland, Anushka Anand, and Robert Grossman. 2005. “Graph-Theoretic Scagnostics.” In IEEE Symposium on Information Visualization (InfoVis 05), 157–58. IEEE Computer Society.

  1. The power of R is that it is extendable and open source - put simply, if a function doesn’t exist or is difficult to use, anyone can create a new package that contains data and code to allow you to perform new tasks. You may find it helpful to think of packages as additional apps that you need to download separately to extend the functionality beyond what comes with “Base R.”↩︎

  2. Because there are so many different ways to achieve the same thing in R, when Googling for help with R, it is useful to append the name of the package or approach you are using, e.g., “how to make a histogram ggplot2.”↩︎

  3. In this tutorial we have chosen to gloss over the data wrangling that must occur to get from the raw data to these aggregated values. This type of wrangling requires a more extensive tutorial than this paper can provide but, more importantly, it is still possible to use R for data visualisation having done these preparatory steps using existing workflows that newcomers to R are comfortable with. The aim of this paper is to bypass these initial, often problematic steps and focus on tangible outputs that may then encourage further mastery of reproducible methods.↩︎

  4. That is to say, if you are new to R, know that many before you have struggled with this conceptual shift - it does get better, it just takes time and your preferred choice of cursing.↩︎